Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, links to the materials will be available for download at QUERCUS. The teaching materials will consist of an R Markdown Notebook with concepts, comments, instructions, and blank coding spaces that you will fill out with R by coding along with the instructor. Other teaching materials include a live-updating HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
We’ll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to take you from some potential scenarios such as…
A pile of data (like an excel file or tab-separated file) full of experimental observations that you don’t know what to do with it.
Maybe you’re manipulating large tables all in excel, making custom formulas and pivot tables with graphs. Now you have to repeat similar experiments and do the analysis again.
You’re generating high-throughput data and there aren’t any bioinformaticians around to help you sort it out.
You heard about R and what it could do for your data analysis but don’t know what that means or where to start.
and get you to a point where you can…
Format your data correctly for analysis.
Produce basic plots and perform exploratory analysis.
Make functions and scripts for re-analysing existing or new data sets.
Track your experiments in a digital notebook like R Markdown!
In the first lesson, we will talk about the basic data structures and objects in R, get cozy with the R Markdown Notebook environment, and learn how to get help when you are stuck because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), and then subset and merge data. After that, we will dig into the data and learn how to make basic plots for both exploratory data analysis and publication. We’ll follow that up with data cleaning and string manipulation; this is really the battleground of coding - getting your data into just the right format where you can analyse it more easily. We’ll then spend a lecture digging into the functions available for the statistical analysis of your data. Lastly, we will learn about control flow and how to write customized functions, which can really save you time and help scale up your analyses.
Don’t forget, the structure of the class is a code-along style: it is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don’t have to spend your attention on taking notes.
There is no single path correct from A to B - although some paths may be more elegant, or more efficient than others. With that in mind, the emphasis in this lecture series will be on:
tidyverse series of packages. This resource is
well-maintained by a large community of developers. While not always the
“fastest” approach, this additional layer can help ensure your code
still runs (somewhat) smoothly later down the road.This is the fifth in a series of seven lectures. Last lecture we
learned the basics of data visualization with the ggplot2
package. This week we return to the world of data cleaning with a very
important tool - the regular expression! At the end of this session you
will be able to use tidyverse tools and regular expressions
to tidy/clean your data. This week our topics are broken into:
stringr package.Grey background: Command-line code, R library and
function names. Backticks are also use for in-line code.... fill in the code here if you are coding alongBlue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your RStudio folders.
We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto datatools
Hub. You will need to use your UTORid credentials to complete the
login process. From there you will find each week’s lecture files in the
directory /2024-09-IntroR/Lecture_XX. You will find a
partially coded skeleton.Rmd file as well as all of the
data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF or HTML file under the Modules section of Quercus.
Today we have 3 data files to help us work through the concepts of data cleaning with regular expressions.
This is an example file for us to start playing with the idea of regular expressions.
This is the main file that we’ll be working with for the rest of the lecture. We’ll search, replace, and manipulate data from this file after importing it into our notebooks.
We’ll return to this metadata towards the end of lecture but it holds all of the experimental condition information that we’ve been going over all term.
The following packages are used in this lesson:
tidyverse (tidyverse installs several packages for you,
like dplyr, readr, readxl,
tibble, and ggplot2). In particular we will be
taking advantage of the stringr package this week.Some of these packages should already be installed into your Anaconda
base from previous lectures. If not, please review that lesson and load
these packages. Remember to please install these packages from the
conda-forge channel of Anaconda.
#--------- Install packages to for today's session ----------#
# install.packages("tidyverse", dependencies = TRUE) # This package should already be installed on Jupyter Hub
#--------- Load packages to for today's session ----------#
library(tidyverse)
## Error: package or namespace load failed for 'tidyverse':
## .onAttach failed in attachNamespace() for 'tidyverse', details:
## call: library(pkg, lib.loc = loc, character.only = TRUE, warn.conflicts = FALSE)
## error: there is no package called 'ggplot2'
In previous weeks the data cleaning we’ve worked with has been more in the realm of data management - moving columns, converting from wide to long format, and making new variables. We’ve been more focused on getting the data into a proper format for analysis. Aside from splitting multi-variable columns apart, however, we have done very little to alter the raw data values and headings themselves.
Why do we need to do this?
‘Raw’ data is seldom (read: nearly never) in a usable format. Data in tutorials or demos have already been meticulously filtered, transformed and readied to showcase that specific analysis. How many people have done a tutorial only to find they can’t get their own data in the format to use the tool they have just spent an hour learning about?
Data cleaning requires us to:
Some definitions might take this a bit farther and include normalizing data and removing outliers. In this course, we consider data cleaning as getting data into a format where we can start actively exploring our data with graphics, data normalization, etc.
In our previous lectures we focused on how to transform data into a
tidy format using functions from the dplyr and
tidyr packages. One thing we’ve been avoiding addressing is
cleaning up problematic column names and values. We’ve worked with some
specific <tidy-select> functions that give us a hint
about what regular expressions are doing but we haven’t taken advantage
of the true power in regular expressions.
Today we are going to mostly focus on the data cleaning of text. This step is crucial for taking control of your dataset and your metadata. It is considered by many that the prelude to transforming data is actually doing the grunt work of data cleaning.
I have included the functions I find most useful for these tasks but I encourage you to take a look at the Strings Chapter in R for Data Science for an exhaustive list of functions. So let’s get to it!